-
Notifications
You must be signed in to change notification settings - Fork 0
Detect and recover from master worker death during setup #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ssteele110
wants to merge
11
commits into
master
Choose a base branch
from
sammyst/master-setup-heartbeat
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+219
−21
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7ac3885 to
8bb5884
Compare
When the master worker is killed before completing queue setup, the master-status stays at 'setup' and all non-master workers poll for ~120s then fail with LostMaster. This change adds a master setup heartbeat mechanism that allows workers to detect a dead master and atomically re-elect a new one. Changes: - Add master_setup_heartbeat_interval (5s) and master_setup_heartbeat_timeout (30s) config options - Master sends heartbeat during setup via background thread - Workers detect stale heartbeat and attempt atomic takeover via Lua script - Guard in push() prevents split-brain (old master pushing after replacement) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous guard check in push() had a race window between checking master-worker-id and executing the transaction. A takeover could occur in that window, causing both old and new master to push tests. Use Redis WATCH to monitor master-worker-id. If it changes between WATCH and MULTI execution, the transaction automatically aborts, preventing duplicate test pushes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, there was a window between acquiring master role and entering with_master_setup_heartbeat where no heartbeat existed. Other workers could see status='setup' with no heartbeat and incorrectly attempt takeover. Now the initial heartbeat is set immediately after acquiring master role, closing this window. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…keover After taking over as master, check if master-status is still "setup" before running setup. If status is already "ready", the original master completed its push before we could act, so we skip to avoid duplicate test pushes. This fixes the race where: 1. Original master's push() passes WATCH check 2. Heartbeat becomes stale during redis.multi execution 3. Worker takes over (status still "setup") 4. Original master's transaction commits, sets status="ready" 5. Without this fix, takeover worker would also push tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…after takeover" This reverts commit 636796e.
Extract shared master setup logic (reorder tests, store chunk metadata, push to queue) into execute_master_setup method. Used by both initial master setup in populate() and takeover in run_master_setup(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8bb5884 to
ddfef01
Compare
The respond_to? check was looking for :run_master_setup but the actual method is named :execute_master_setup, causing takeover to succeed but never run setup or return early, eventually timing out with LostMaster. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Problem
When the master worker is killed before completing queue setup:
master-statusstays at'setup'(until 8-hour TTL expires)LostMasterSolution
push()prevents split-brain (old master can't push after replacement)Key Changes
Configuration (
lib/ci/queue/configuration.rb)master_setup_heartbeat_interval(default: 5s)master_setup_heartbeat_timeout(default: 30s)Worker (
lib/ci/queue/redis/worker.rb)with_master_setup_heartbeat- heartbeat thread during setupattempt_master_takeover- calls Lua scriptrun_master_setup- runs setup after takeoverpush()Base (
lib/ci/queue/redis/base.rb)wait_for_master()now checks for stale heartbeatmaster_setup_heartbeat_stale?helperLua Script (
redis/takeover_master.lua)Test plan
🤖 Generated with Claude Code